
PCX '^^^^^^^^^^^^^^Sl^^^^^'^^^'^^ 

INTERNATIONAL APPUCATION PUBUSHED UNDER THE PATENT COOPERATION TTIEATY (PCT) 



(SI) Internatkuud Patent Classification ^ : 
GIOL 3A)2 



Al 



(11) fntematlonal PidiUcation Number: WO 99/6(494 

(43) Ihternatiottal Publication Date: 23 Deconbcr 1999 C23A2S9) 



ai) InCeniatlonal AppUcatfon Nomben PCr/US99iri2S04 
(22) Intemational FUing Date: 16 June 1999 (16:06.99) 



(30) Priority Data: 
09/099.952 



19 June 1998 (19X)6.98) 



US 



(71) Applicant: COMSAT CORPORATION [USAJS]; 6560 Rock 
Spring Drive. Bcthcsda, MD 20817 (US). 



(72) Inventors: HO. Otant. Ian; 84 Noth Hills Terrace, Don Mills, 
Ontario M3C IM6 (CA). BARANffiCKI. Marion; 4781 
Bmdon Court, Falifex, VA 22032 (US). YELDENER. 
Soat; 19606 Oystai Rock Drive #14. (Sennantown. MD 
20874 (US), 

(74) Agents: CUSHING. David, J. ct al.; Sughrue, Mien, Ziirn, 
MacPeak A Seas. PLLC. Suite 800. 2100 Pennsylvania 
Avenue. N.W., Washington. DC 20037-3202 (US). 



(81) Designated States: AU. CA. IN. European patent (AT, BE, 
CH, CY. DE, DK. ES, FI. FR. GB, OR. IE. IT, LU. MC. 
NL. PT. SE), 



Published 

WUh imematiotutl starch report 

B^e the expiration of the time limit for amending the 
claims and to he republished in the event of the recent of 
amendments. 



(54) THle: IMPROVED LOST FRAME RECOVERY TECHNIQUES FOR PARAMEDIC, LPC-BASED SPEECH CODING SYSTEl^ 



LSP 

nWAMETERS 



ADAPTIVE 
CODGBOOK 
PARAMETERS 



FiXH) 
COOEBOOK 
MRAMEIERS 



LSP 




ISP 


DECODE 




INIERPOUOOR 



EXcnAnoN 

DECODE 



PITCH 






PITCH 




IPC 
SYNTieSIZE 
FLIER 




FOR 


MAN 


DECODE 


-< 




POSIFLTER 






POSTFLTER 



UNITSCAUN 
GAIN 



OUTPUT 



(57) Abstract 

^ ^ f""** wcoveiy technique for LPC-based systems employs interpolation of paiameteis ftom previous and subsequent good 
ftamca, selective atttnutfion of frame energy when die energy of a subframe exceeds a dueshold. and energy tapering in die presence of 
multiple successive lost ftames. 
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IMPROVED LOST FRAME RECOVERY TECHNIQUES FOR 
PARAMETRIC, LPC-BASED SPEECH CODING SYSTEMS 

Background of the laventioa 

The transmission of compressed speech owet packet-switching and mobile 
conmiunications networks involves two major systems. The source speech system 
encodes the speech signal on a frame by frame basis, packetizes the compressed 
5 speech into bytes of information, or packets, and sends these packets over the network. 
Upon reaching the destination speech system, the bytes of information are 
unpacketized into frames and decoded. The G.723,1 dual rate speech coder, described 
in ITU-T Recommendation GJ23J, **Dual Rate Speech Coder for Multimedia 
Communications Transmittmg at 53 and 6.3 ld)it/s," March 1996 (hereafter 

10 "Reference 1", and incorporated herein by reference) was ratified by the ITU-T in 
1996 and has since been used to add voice over various packet-switching as well as 
mobile communications.networks. With a mean opmion score of 3-98 out of 5.0 (see, 
Thiyft, A. R., **Voice over IP Looms, for Iiriranets m *98 /' Electronic Engineering 
Times, August, 1997, Issue: 967, pp. 79, 102, hereafter "Reference 2", and 

IS incoiporated h^ein by refi^ce), the near toll quality of the G.723.1 standard is ideal 
for real-time multimedia implications over private and local area networks (LANs) 
where packet loss is minimal. However, over wide area netwoiks (WANs), global 
area netwoiks (GANs), and mobile conununications netwoiks, congestion can be 
severe, and pack^ loss may result in heavily degraded speedi if left untreated. It is 

20 therefore necessaiy, to develop techniques to reconstruct lost speech frames at the 
recdvo: in order to minimize distortion and maintain output intelligibility. 

The following discussion of the G.273.1 dual rate coder and its error 
concealment will assist in a fidl understanding of the invention. 

The G.723.1 dual rate speech coder encodes 16-bit linear pulse-code 
25 modulated (PCM) speech, sampled at a rate of 8 KHz, using linear predictive analysis- 
by*synthesis coding. The excitation for the hig^ rate coder is Multipulse Maximum 
Likdihood (Quantization (MP-MLQ) while the excitation for the low rate coder is 
Algebraic-Code-Bxctted Linear-Prediction (ACELP). The encoder operates on a 30 
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the average of the gains for subfirames 2 and 3 of the previous frame. Otherwise, for 
the voiced case» the previous frame is attenuated by 2.5 dB and reg«ierated with a 
periodic excitation having a period equal to the estimated pitch lag. If packet losses 
continue for the next two frames, the regaierated excitation is attenuated by an 
5 additional 2.5 dB for each frame, but after three inteq)olated frames, the output is 
completely muted, as described in Referaice 1. 

The G.723.1 error concealment strategy was tested by srading various speech 
stents ovw a network with packet loss levels of 1%, 3%, 6%, 10%, and 15%, 
Single as well as multiple packet losses were simulated for each level. Through a 
10 series of infomial listenmg tests, it was shown that although the overall output quality 
was very good for lower levels of packet loss, a number of problems persisted at all 
levels and became increasingly severe as packet loss uicreased. 

Ffrst, parts of the output segment sounded unnatural and contained many 
annoying, metallic-soundmg artificts. The unnatural sounding quality of the output 

15 can be attributed to LSP vector recovery based on a fixed predictor as previously 
described. Since the missing frame's LSP vector is recovered by applying a fixed 
predictor to the previous frame's LSP vector, the spectral changes between the 
previous and reconstructed frames are not smooth. As a result of the fdlure to 
genemte smooth spectral changes across missing frames, unnatural sounding output 

20 quality occurs, whidi increases unintelligibility during high levels of packet loss. In 
addition, many hig^-frequency, metallic-sounding artifacts were heard in the ou^ut. 
These metallic-sounding artifacts primarily occur in unvoiced regions of the ou^ut, 
and are caused by incorrect voicing estimation of the previous frame dining excitation 
recovoy. In other words, since a missing, unvoiced frame may incorrectly be 

25 classified as voiced, then transition into the missing fr^e will generate a high- 
frequency glitdi, or metallic-sounding artifact, by applying the estimated pitch lag 
computed for the previous frame. As packet loss increases, this problem becomes 
even more severe, as incorrect voicing estimation generates increased distortion. 

Another problon using G.723.1 enror conceahnent was the presence of high- 
30 energy spikes in the output. These high-energy spikes, which are especially 
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by applying the second part of the linear interpolation technique, almost all imwanted 
metallic-sounding artifacts are effectively masked away. 

To eliminate the effects of high-energy spikes, a selective energy attenuation 
technique was developed. This technique checks the signal energy for every 
5 synthesized subfiame against a threshold value, and attenuates all signal energies for 
the entke fiame to an acceptable level if the threshold is exceeded. Combined with 
linear interpolation, this selective energy attenuation technique effectively eliminates 
all instances of high-energy spikes fix>m the output 

Finally, an ^ergy tapering technique was designed to eliminate the effects of 
10 **dioppy^ speech. Whenev^ multiple packets are lost in excess of one frame, this 
technique simply repeats the previous good fiame for every missing fiame by 
gradually decreasing the repeated fiiame's signal energy. By en4>loying this 
technique, the energy of the ou^ut signal is gradually smoothed or tapered ov^ 
multiple packet losses, thus eliminating any patches of siloice or a *'choppy" speedi 
15 effect evidmt in G.723.1 error coiiceahnent Another advantage of energy tapping is 
the relatively small amount of computation time required fi>r reconstructing lost 
packets. Con4>ared to G.723.1 error concealment, since this tedmique only involves 
gradual attenuation of the signal energies for repeated firames, as opposed to - 
performing G.723.1 fixed LSP prediction and excitation recovery, the total algorithmic 
20 delay is con^derably less. 

Brief Description of the Drawing 

The invention will be more clearly undeci^tood &om the foUomng description 
in conjunction with the accompanying drawings wherdn: 

Fig, 1 is a block diagram showing G.723.1 decoder operation; 
25 Fig. 2 is a block diagram illustrating the use ofFuture, Ready and Copy buffers 

m the interpolation tedmique according to the present invention; 

Figs. 3a-3c are waveforms illustrating the elimination of high en^^ spikes by 
the error conceahnent tedmique of the present invration; and 
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current frame^ is a good or missing fiame that is currently being processed by 
the decoder, and is stored in the Ready Buffer. 

future Jrame^ is a good or missing frame immediately follomng the currait 
frame, and is stored in the Future Buffer. 

Linear interpolation is a multi-step procedure that operates as follows: 

1. The Ready Buffer stores the current good frame to be processed while 
the Future Buffer stores the future frame of the encoded speech sequence. A 
copy of the current frame's speech model parameters is made and stored in the 
Copy Buffer, 

2. The status of the fiiture frame, eitho- good or missing, is determined If 
the future frame is good, no linear interpolation is necessary, and the linear 
interpolation flag is reset to 0. If the future frame is missing, linear 
interpolation might be necessary, and the linear interpolation flag is 
temporarily set to 1. (In a real-time system, a missing frame is detected by 
either a receiver tuneout or Cyclical Redundancy Check (CRC) failure. These 
missing fiame detection algorithms however, are not part of the invention, but 
must be recognized and incorporated at the decoder for proper operation of any 
packet reconstruction strategy.) 

3. The current fiame is decoded and synthesized A copy of the currmt 
fi:ame's LPC synthesis filter and pitch postfiltered excitation are made. 

4. The future firame, originally in the Future Buffer, becomes the current 
fiame and is stored in the Ready Buffer. The next frame in tfie encoded speech 
sequence arrives as the fixture fi:ame in the Future Buffer. 

5. The value ofthe linear interpolation flag is checked If the flag is set to 
0, the process jumps back to step (1). If the flag is set to 1, the process jumps 
to step (6). 

6. The status of the future firame is determmed If the future firame is 
good, linear interpolation is applied; the linear interpolation flag remains set to 
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13. The future fiame» originally in the Future Buffer, becomes the current 
fiame and is stored in the Ready Buffer. The next frame in the encoded speech 
. sequence arrives as the future frame in the Future Buffer. The process then 
returns to step (1). 

5 There are at least two important advantages of linear interpolation over 

G.723.1 error concealment. The first advantage occurs in step (7)» during LSP 
recovery. In Step (7), since linear interpolation determines the missing frame*s LSP 
parameters based on the previous and future frames, fliis provides a better estimate for 
the missing frame's LSP parametm, therd>y enabling smoother spectral changes 

10 across the missing frame, than if fixed LSP prediction were sinq)ly used, as in G.723. 1 
error concealment. As a result, more natural sounding, intelligible speech is 
generated, thereby increasmg comfortability for the listener 

The second advantage of linear inteq}olation occurs in steps (8) to (1 1), during 
^citation recovery. lurst, in step (8), since linear interpolation generates the missing 

IS frame's gmn parameters by averaging the fixed codebook gains between die previous 
and future frames, it provides a better estimate for the missing fi:ame*s gain, as 
opposed to the technique described in G.723.1 error concealment. This interpolated 
gain, which is then applied for unvoiced frames in step (10), thereby generates 
smoother, more comfortable sounding gain transitions across frame erasures. 

20 Secondly, in step (11), voicing classification is based on the both the predictor gain 
and estimated pitch lag, as opposed to &e predictor gain alone, as in G.723.1 error 
concealment. That is, frames whose predictor gain is greater than 0.58 dB are also 
compared agdnst a threshold pitch lag, Pthresh- Since unvoiced firames are primarily 
composed of higji-fi:equency spectra, those frames that have low estimated pitch lags, 

25 and hence, high estimated pitch frequencies, thereby have a higher probability of 
being unvoiced Thus, frames whose estimated pitdi lags fall below Pthiesh ^ 
declared unvoiced and those whose estimated pitch lags exceed Pthitsh> are declared 
voiced. In sum, by selectively determining a fi:ame's voicing classification based on 
both the predictor gain and estimated pitch lag, the technique of this invention 

30 effectively masks away all occurrences of high-frequency, metallic-soimding artifiu^ 
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copy of the current fiame's speech model parameters is made and stored in the 
Copy Buffer. 

2. The status of the future fiame, either good or missing, is deteraiined. If 
the future frame is good, no linear interpolation is necessary; the linear 
inteipolation is reset to 0. If the future frame is missing, linear interpolation 
might be necessaiy; the linear interpolation flag is ten:q)orarily set to 1. 

3. The current frame is decoded and synthesized. A copy of the current 
frame's LPC synthesis filter and pitch postfiltered excitation is made. 

4. The future frame, originally in the Futvire Buffer, becomes the current 
frame and is stored in the Ready Buffer. The next frame in the encoded speech 
sequence arrives as the future frame in the Future Buffer. 

5. The value of the linear interpolation flag is checked. If the flag is set to 
0, the process jumps back to step (1). If the flag is set to 1, the process jumps 
to step (6). 

6. The status of the future frame is determined. If the future frame is 
good, linear interpolation is applied as described in subsection 3.1. If the 
future frame is missing, energy tapering is applied; the energy tapering flag is 
set to 1, the linear interpolation flag is reset to 0, and the process jumps to step 
(7). 

7. The copy of the previous frame's pitch postfiltered excitation, from 
step (3), is attenuated by (0.5 x value of energy tapering flag) dB. 

8. The copy of the previous fi:ame's LPC synthesis filt^, from step (3), is 
used to synthesize the current firame using the attenuated excitation in step (7). 

9. The future fitame, origindly in the Future Buffer, becomes the current 
fime and is stored in the Ready Buff<^. The next frame in the encoded speech 
sequence arrives as the future firame in the Future Buffer. 

10. The current frame is synthesized using steps (7) to (9), then jvaaaps to 
step (11). 
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more natural sounding speech and effective masking away of all metallic-sounding 
artifacts were achieved due to smoother spectral transitions across missing frames 
based on linear interpolation and improved voicing classification. Secondly, all high- 
energy spikes were eliminated due to selective energy attenuation and linear 

5 interpolation. Finally, all instances of "choppy" speech were eliminated due to energy 
tapering. It is important to realize that as network congestion levels increase, the 
amount of packet loss also increases. Thus, in order to maintain real-time speech 
intelligibility, it is essential to develop techniques to successfully conceal frame 
erasures while minimizing the amount of degradation at the output The strategies 

10 developed by the authors represait techniques which provide improved output speech 
quality, are most robust in the presence of frame erasures compared to the techniques 
described in Reference 1, and can be easily sqjplied with any parametric, LPC-based 
speech coder over any packet-switching or mobile communications network. 

It will be appreciated that various changes and modifications may be made to 
IS the specific embodiments described above without dq}aiting from the spirit and scope 
of the invention as defined in the appended claims. 
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5. A method according to claim 1, wherein on loss of multiple successive 
frames, said method con:4)rises the step of rq>eating the encoded signals for a frame 
immediately preceding said multiple successive frames while gradually reducing the 
signal energy for each recovered frame. 



6. A method according to claim 2, wherein said encoded signals include 
said LSP parameters, fixed codebook gains and further excitation signals, said method 
comprising interpolating said fixed codebook gain of said lost frame from the fixed 
codebook gains of said first and second frames, and adopting said further excitation 
signals from said first frame as the further excitation signals of said lost frame. 

7. A method of recovering a lost frame in a system of the type wherein 
information is transmitted as successive frames of encoded signals and tiie information 
is reconstructed bom said encoded signals at a receiver, said method comprising: 

calculatmg an estimated pitch value and pre(Uctor gain for a first frame 
prior to said lost frame; and 

classifying said lost frame as voiced or unvoiced in accordance with 
said predictor gain and estimated pitch value from said first frame. 

8. A method of recovering a lost frame in a system of the type wherdn 
information is transmitted as successive fiames of encoded signals, each frame 
including plural subframes, and the information is reconstructed from said oicoded 
signals at a recdver, said method comprising: 

con^>aring a signal energy for each subframe of a particular frame 
against a threshold; and 

attenuating signal energies for all subframes in said particular fiame if 
the signal energy in any subframe exceeds said threshold. 
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